The Twitter of Babel: Mapping World Languages through 
Microblogging Platforms 

Delia Mocanu x , Andrea Baronchelli 1 , Nicola Perra 1 , Bruno Goncalves 2 , Alessandro Vespignani 1 ' 3 ' 4 

1 Laboratory for the Modeling of Biological and Socio-technical Systems, Northeastern 
University, Boston, MA 02115 USA 

2 Aix Marseille Universite, CNRS UMR 7332, CPT, 13288 Marseille, France 

3 Institute for Quantitative Social Sciences at Harvard University, Cambridge, MA 02138 
USA 

4 Institute for Scientific Interchange Foundation, Turin, Italy 

Abstract 

Large scale analysis and statistics of socio-technical systems that just a few short years ago would 
have required the use of consistent economic and human resources can nowadays be conveniently 
performed by mining the enormous amount of digital data produced by human activities. Although 
a characterization of several aspects of our societies is emerging from the data revolution, a number 
of questions concerning the reliability and the biases inherent to the big data "proxies" of social 
life are still open. Here, we survey worldwide linguistic indicators and trends through the analysis 
of a large-scale dataset of microblogging posts. We show that available data allow for the study of 
language geography at scales ranging from country-level aggregation to specific city neighborhoods. 
The high resolution and coverage of the data allows us to investigate different indicators such as 
the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and 
the geographical distribution of different languages in multilingual regions. This work highlights 
the potential of geolocalized studies of open data sources to improve current analysis and develop 
indicators for major social phenomena in specific communities. 



1 Introduction 

Modern life, with its increasing reliance on digital technologies, is opening unanticipated opportunities for 
the study of human behavior and large scale societal trends. Cell phones have been playing a pivotal role 
in this revolution, serving as ubiquitous sensors, and the default point of contact for online activities [l]|2]. 
As a whole, mobile clients for microblogging platforms, social networking tools, and other "proxy" data 
of human activity collected in the web allow for the quantitative analysis of social systems at a scale that 
would have been unimaginable just a few years ago [3j|6]. In particular, the possibility of using mobile- 
enabled microblogging platforms, such as Twitter, as monitors of public opinion, social movements and 
as tools for the mapping of social communities has generated much interest in the literature |7]jl4|. At 
the same time it is crucial to understand to which extent the picture of socio-technical systems emerging 
from digital data proxies is a statistically sound and how well it does scale to a planetary dimension [l5] . 

In this paper, we perform a comprehensive survey of the worldwide linguistic landscape as emerging 
from mining the Twitter microblogging platform. Our large-scale dataset, gathered over approximately 
two years, at an average rate of 6.5 x 10 5 GPS-tagged tweets per day, contains information about almost 
6 million users and provides a uniquely fine-grained survey of worldwide linguistic trends. By coupling 
the geographical layer to the identification of the language of single tweets we are able to determine the 
detailed language geography of more than 100 countries worldwide [16]. 

Although previous studies have investigated the language dynamics |17| of Twitter, those analysis have 
focused on specific, yet interesting, aspects concerning the combined study of language and geographi- 
cal analysis in Twitter, and a global picture is still lacking. For instance, most represented languages 
have been identified for the Top-10 more active countries [l8], language-dependent differences have been 



pointed out in the user activity related to the posting and conversations patterns 19 , and language has 



been shown to be a strong predictor for the formation of follower /followee relations 20 . For this reason 
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and for the sake of assessing the generality and planetary scalability of our analysis, we have first focused 
on the reliability of geospatial trends extracted from our dataset. Interestingly, we find a universal pat- 
tern describing users' activity across countries, and a clear correlation between Twitter adoption and the 
Gross Domestic Product (GDP) of a country, further characterized by well defined continent-dependent 
trends. 

The high quality of the dataset permits the study of the spatial distribution of different languages 
at different scales from aggregated country- level analysis to the neighborhood scale. In particular we 
can drill down data of linguistic macro areas and single out heterogeneities at the country and regional 
level, scrutinizing the cases offered from Belgium and Catalonia (Spain) as examples. Furthermore we 
explore the resolution offered by the data at very fine level of granularity and inspect the city and 
neighborhood levels, taking as case studies the spatial distribution of French and English languages in 
Montreal (Canada) and inspecting linguistic majorities in New York City (USA). We find that Twitter is 
able to reproduce the geospatial adoption of languages for a wide range of resolution scales. We contrast 
our results against census data, and discuss the possible sources of discrepancies between the two. Finally, 
we broaden our perspective by addressing the seasonality patterns in the language composition of the 
Twitter signal. We use touristic countries such as Italy, Spain, and France to single out clear seasonal 
trends like, for instance, the increase of English and other languages during the summer holiday season. 
Overall, our analysis highlights the potential of Twitter data in defining open source indicators for 
geospatial trends at the planetary scale. 

The paper is structured as follows. In section [2] we go over data selection criteria as well as statistical 
measures regarding the universality of users behavior. Within this framework, we investigate several 
relevant examples in language geography (section |2?T| ) and explore the temporal dimension for seasonal 
patterns (section 2.2). A discussion (section |3| of the results is followed by a thorough description of the 
data sets and methodology used (section p|. 



2 Results 

Our analysis is based upon Twitter data gathered in approximately 20 months between October 18, 2010 
and May 17, 2012, at an average rate of 6.5 x 10 5 GPS-tagged tweets per day (see Table [l] for exact 
numbers). The dataset includes 3.8 x 10 8 tweets produced by 6.0 x 10 6 users located in 191 countries, 110 
of which generated the amount of data necessary for a significant statistical analysis of language detection. 
Our language detection methods allowed us to identify 78 languages. Our analysis is restricted to GPS- 
tagged tweets in order to preserve maximum level of geographical detail, taking into account both live 
GPS updates and device stored locations. The amount of geolocalized signal could in fact be increased by 



considering different kinds of metadata, like for example self reported locations 13 , but these procedures 
would not allow us to reach the level of granularity and detail we aim to. Further details about the data 
collection and analysis procedures, as well as on the (live) GPS metadata, can be found in the Methods 
section. Overall, considering the recent literature, and to the best of our knowledge, the amount of 
GPS-tagged data we have gathered is certainly remarkable not only in terms of volume, but also for the 
covered geographical and temporal extension. 

Fig. [I] illustrates the potential of inspection at different resolutions, from continent to city level, 
highlighting the detailed structure that is visible at each scale. Countries are easily identified along with 
their major metropolitan areas, and even within specific cities it is possible to observe a high degree of 
details. Coupling this geographical resolution with language detection tools (see Methods) provides us 
with a remarkable view of how languages are used in different areas. However, Twitter adoption is not 
homogeneous across different countries. Fig. [2] ranks countries in descending order in terms of Twitter 
adoption, defined as the ratio between Twitter users and total population (i.e. Twitter users per 1,000 
inhabitants). The emerging picture is highly heterogeneous, as expected, since our data come exclusively 
from smart phone devices that are consequentially tied to the availability of local infrastructures. In order 
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to support the hypothesis that economic diversity is a primary source of heterogeneity in the Twitter 
adoption (in mobile devices), we investigated whether the Gross Domestic Product (GDP) of a country 
could serve as a predictor of microblogging adoption. Fig. [3] shows that this is the case, the GDP and the 
Twitter users per capita being clearly correlated. Moreover, different continents (identified by different 
color codes in Fig. [3| cluster together, indicating, that cultural as well as socio-economic factors concur 
at once in determining the observed pattern. 

Geographical analyses at any scale require the aggregation of the signal produced by different users, 
and it is crucial to have a clear understanding of the patterns of single user activity. One might suspect 
that usage patterns at the individual level may show large heterogeneities across country and thus cultures. 
In order to test statistically the presence of different usage patterns we gather the number of tweets per 
unit time sent by each single identified user. From this data we construct the probability density function 
p (TV) that any given user emits N tweets per considered unit time. In our analysis we considered as 
reference unit time one day. Furthermore, the p (N) distribution can be analyzed by restricting the 
statistical analysis to users belonging to a specific country, a specific language or both. Interestingly, 
Fig. [4] shows that the distributions exhibit a universal shape irrespective both of country (panel A), 
language (panel B), or the weight of each countries on specific languages (panel C). As we will see this 
finding is pivotal for an unbiased comparison of different geographical and linguistic scenarios. Any 
dependence of the activity distribution upon the language or location of the users would have reduced 
the array of possible analysis. It is worth stressing also that the curves overlap each other naturally, 
i.e., with no need for any rescaling or transformation. Although this feature indicates a very strong 
statistical homogeneity at the population level, the observed distribution turns out to span almost 4 
orders of magnitude. The broad nature of this universal distribution is clear evidence of strong individual 
level heterogeneity. For this reason, in order to avoid distortions due to extremely active users, we 
consider only the proportion of tweets emitted by each user in a given language. Thus, a user i that 
tweets in a set, L, of different languages, L = {A, B, C, . . . , Z}, will contribute to each language X for 
a fraction N^/J^y^y- We define N l x the total number of tweets written by the user in language X. 
We adopt the same normalization also for the position of the user. The reasons for this normalization 
are multiple. First, the amount of tweets collected for each user ranges over several orders of magnitude. 
Very active users, as well as automatic bots, might therefore distort or mask the signal coming from 
"common" individuals. Second, tourism might be a strong source of noise when trying to understand 
the demographics of a country or of a city. Touristic locations in the South of France or Italy might for 
example exhibit a high proportion of tweets in English or German. 

2.1 Language analysis at different geographic scales 

The ranking of languages in our signal is presented in Fig. |5j where the ordering is determined by the 
number of users we observe for each one of them. As expected, English is largely dominant. Spanish 
occupies the second position despite being almost 6 times less popular. Interestingly, these languages are 
followed by Malay and Indonesian, reflecting the fact that Indonesia is a very active country in absolute 
terms, even though in terms of users per capita the country is only ranked in the 30th position (see 
Fig. |6|. Here the effect of each countries population size becomes clear. A large country as Indonesia 
does not need a large per capita Twitter penetration to make its language very visible in Twitter, while 
much smaller Netherlands does. And in fact the Netherlands is the second country in terms of users per 
capita (see Fig. [6|, making Dutch the 8th most common language. 

It is worth stressing that our statistics do not reflect the overall estimates of language speakers in 
the world. According to Ethnologue: Languages of the World [2l] , when native and secondary speakers 
are considered together Standard Chinese leads the ranking (1.0 x 10 9 speakers), followed by English 
(5.0 x 10 8 speakers), Spanish (3.9 x 10 8 speakers), Hindi (3.0 x 10 8 speakers) and Russian (2.5 x 10 8 
speakers), with Malay /Indonesian ranked as Sth (1.6 x 10 8 speakers). These discrepancies do not prevent 
us from extracting meaningful information in countries where Twitter is sufficiently high to serve as 
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an accurate mirror of the population, but it serves as a reminder that we are observing the worldwide 
linguistic landscape through the lenses of a (specific) microblogging platform which, for example, is not 
available in China. Also the age composition of Twitter users must be taken into account if one is to 



compensate for differences with respect to the official census data 22 



Country level. When we color each tweet according to its language and display them on a map we see 
immediately that most content produced within each country is written in its own dominant language (see 
Fig ^ A). This is further confirmed in Fig|6]-B, which shows the extent to which the dominant language 
prevails over other idioms in each country. In Figure [7] we plot, for each of the Top 20 countries (by 
number of tweets), the fraction of users tweeting in each language. Interestingly, countries like France 
and Italy, which are characterized by a well defined and substantially homogeneous linguistic identity, 
emit more than 20% of their tweets in English and other languages. Since the most common language in 
Twitter is English, this is perhaps not surprising. It is in fact reasonable that even users of non-English 
speaking countries choose to Tweet in English as a form of reaching out to a broader audience. 

Regional level. To understand the geospatial heterogeneity of different linguistic backgrounds, we 
drill down data to small - within-country- scales. It is interesting, for instance, to look at the spatial 
distribution of the different languages in multilingual regions. Figure [8} A illustrates the geographical 
distribution of languages used in Belgium, where the North part of the country uses predominantly 
Flemish, while in the South of the country the dominant language is (Walloon) French. Overall, Flemish 
accounts for 36.3% of the users, while French is the language of 14.7% of the users within the country 
borders, i.e. Dutch is 2.5 times more popular than French. Census data set the Dutch to French ratio 
(as first Languages) to 1.5 [23 . The result emerging from the Twitter analysis is qualitatively correct, 
the quantitative mismatch being explained by the different Twitter penetration in neigboring France and 
Netherlands, whose dominant language is of course French and Dutch. In the first case, the number of 
users per 1000 inhabitants is 0.85, while in the second is 6.34, more than 7 times higher (see also Fig.|2|. 
The Dutch speaking population of Belgium finds itself embedded in a much richer Twitter environment, 
and consequently is more involved in the microblogging activity. 

Moving to a within-country scale, Figure [8}B shows the linguistic distribution in Catalonia, an au- 
tonomous region of Spain. Here Catalan and Spanish are clearly intermixed (particularly in Barcelona), 
even though Spanish is the most popular language, with a share of 49.0% of the users where Catalan 
represents 28.2% of the signal, making that Spanish 1.7 times more popular than Catalan. Interestingly, 
the Spanish to Catalan ratio is 1.25 when the habitual language of adults living in Catalonia is considered, 



according to a survey performed in 2008 by the Institute of Statistics of Catalonia 24 . In this case the 



Twitter data is close to the census data, although some considerations are in order. First, census data 
do not take into account the presence of tourists, whose Twitter activity is on the other hand recorded. 
Second, Twitter users may be biased towards the most common languages, in order to reach a wider 
audience. This interpretation is corroborated by the fact that while in our dataset Catalan and Spanish 
account for the 77.2% of the users, they represent the habitual language of 93.5% of the population 
according to the above mentioned survey. In the same way, English, which according to census data is 
customarily spoken by less than 0.01% of the resident population, is adopted by 15.2% of the users. Going 
at a deeper level of inspection, we see that the Catalan language is more widely used in the central and 
Northern part of the region than in the area of Barcelona and the coast connecting this city to Tarragona. 
Remarkably, this pattern agrees with the overall picture provided by census data [24], thus confirming 
once again the validity of online data in providing meaningful informations, even at the within-country 
scale. 

City level. The high quality of the GPS geolocalized signal allows the inspection of the language 
demographics of single cities. Figure [9] shows the city of Montreal, where English and French are the 
most used languages. While English is significantly more popular (65.5% of users, vs. the French 26.9%), 
there appear to be spatial segregation, with French being more popular in the northern neighborhoods. 
Overall, the English is 2.4 times more popular than French in our signal, while the situation is the opposite 



5 



according to census data surveying languages spoken at home, where French is 3.1 times more frequent 
than English [25]. This reversal is not easy to interpret, but we speculate that the geographical location 
of Montreal, and the fact that we do not consider the entire metropolitan population, along with the fact 
that English is in general the privileged communication language in North America, are two factors that 
might play an important role. 

The same analysis can be performed at the level of city neighborhood. In the case of New York 
City, a city known for its cultural diversity, several non-English speaking communities are already well- 



defined and documented 26-30 . For this case study, we partition NYC, Long Island, and New Jersey 
state into districts, towns, and municipalities, respectively. We do not consider the signal in English 
(since it is the official language, and homogeneously predominant in the area) and we focus instead on 
the language exhibiting the second largest number of users inside each district/town. Some of the most 
popular communities are those of Spanish speakers in Harlem, Bronx, and parts of Queens 26 . However, 
Spanish is shared by people from many different cultural backgrounds and it is also widely used across 
the United States. It is thus difficult to estimate the exact location and dimensions of these communities 
solely based on Twitter signal. In fact, it is clear that Spanish dominates as a second language in a number 



of districts of Figure 10 Remarkable, on the other hand, is the clear delimitation of other communities. 
The Korean communities in Palisades Park, NJ and Flushing, NY are of considerable size and also very 
socially active [27|[28l. Marine Park, NY, on the other hand, has a long history of Dutch immigration 



that dates back to the first European settlers in the area 29 . Another notable example is the case of 
Coney Island, NY, which is home to the largest Russian community in the United States [30].The high 
resolution of our dataset allows us to visualize these communities without any a priori assumptions. 

2.2 Seasonal variations 

Now that we have a good characterization of the relative linguistic composition of each country we can 
assess the of use our data to study and analyze seasonal variations of language composition, as this would 
give us valuable insights onto population movements occurring over the course of a year. In particular, we 
might expect that during more touristic seasons one could observe a relative decrease in traffic occurring in 
the local dominant language and a corresponding increase in content being generated in foreign languages. 



In Fig. 11 we show the relative contributions of minority languages from users within a given country as 
a function of the month of the year. In particular we single out traditional touristic destinations, such as 
France, Italy, and Spain, where clear variations are indeed visible during the summer. 

Our analysis allows not only to identify the aggregate touristic fluxes, but also to infer the regions 
of origin on the basis of the observed language. Of course, the pattern we observe are certainly slightly 
biased by the specificity of our observation point, so that for example the contribution of Dutch is likely 
to be constantly overestimated due to the high penetration of Twitter in the Netherlands. However, 
the possibility of observing seasonal fluxes is absolutely remarkable if we consider the low cost, both in 
terms of time and resources, that a Twitter survey requires, compared to more traditional approaches. 
Moreover, monitoring social networks allows us to gain a real-time perspective of the fluxes, which is of 
course extremely hard to achieve through demographic studies. 



3 Discussion 

In this paper we have characterized the worldwide linguistic geography as observed from the Twitter 
platform, aggregating microblogging data at different scales, from country level down to the neighborhood 
scale. Although we show that Twitter penetration is highly heterogeneous and closely correlated with 
GDP, we find that the statistical usage pattern of the microblogging platform turns out to be independent 
from such factors as country and language. This feature allows us to address different issues, such as 
linguistic homogeneity at the country level, the geographic distribution of different languages in bilingual 
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regions or cities, and the identification of linguistically specific urban communities. Focusing on specific 
case-studies, we have shown that while Twitter trends mirror census data quite accurately, even though 
specific deviations might emerge when comparing data that can be influenced by the adoption rate of the 
microblogging platform or the fact that English is the most widely used language in Twitter. Finally, the 
analysis of temporal variations of the language composition of a given country opens up the possibility 
of observing traveling patterns and identifying in real time seasonal traveling and mobility patterns. 
The presented results confirms the potential and opportunities offered by open access data -such as 
microblogging posts- in the characterization and analysis of demographic and social phenomena. 

4 Materials and Methods 
Data Collection 

The datased was obtained by extracting tweets from the raw Twitter Gardenhose feed 31 . The Garden- 
hose is an unbiased sample of 10% of the entire number of tweets provides a statistically significant real 
time view of all activity within the Twitter ecosystem. Twitter added support for explicit geotagging 
of tweets since November 2009, by providing API hooks that could be used by third party developers 
to embedded GPS coordinates within the metadata of each tweet. Since high quality GPS systems are 
increasingly common in mobile devices, this feature immediately became popular with mobile application 
developers and is currently available in hundreds of different twitter clients. On average, about 1% of the 
tweets contain GPS information 

4.1 Language Detection 

Automatically determining the language in which a certain text was written is problem of great practical 
importance for machine learning and data mining. Perhaps the better known example of this is a feature 
in Google's popular web browser, Chrome, that offers to translate a page from it's original language to 
the users native language has a feature that offers to translate a page to the users preferred language. 
The library that detects the original language of the page leverages Googles extensive experience with 
data mining and has been extracted from Chromes source code and made available separately as the 
"Chromium Compact Language Detector" [32] , a library that was extracted from the open source version 
of Google's Chrome browser that is currently in use by millions of browsers around the world. To further 
ensure the accuracy of the result, we filter the results by using an uncertainty threshold within the 
language detector. 

4.2 Geolocalization and Statistics 

We restrict our analysis to tweets containing GPS coordinates, i.e. generated by using a smartphone 
with an Internet connection. This choice allows for the maximum geographical resolution, but inevitably 
reduces the volume of available signal. In fact, the data we have used for this paper constitutes just about 
1% of the signal we have collected, which on its turn is approximately 10% of the total Twitter volume. 

The amount of geolocalized tweets could be increased by considering self-reported informations. In 
fact, users are encouraged to provide their location information in the user profile, but it is not subject to 
any format restriction. Moreover, Twitter platforms do not prompt the user for an update of this field, 
thus any change to this metadata field has to be spontaneous and made voluntarily. For this reason, 
the information in the user profile is sometimes erroneous or has low granularity. While the research 
community is on a continuous quest to understand how to mine and geocode this data, doing so brings 
about many challenges 33 . Moreover, when addressing temporal variations in mobility patterns, the use 
of smartphone GPS coordinates is required. 
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The metadata accompanying a tweet may also contain the geographical coordinates of a previous 
location in the field of self-reported location. These 'historical' locations might bias statistical measures 
involving mobility and/or fine graining, thus we considered them only in generating the language maps 
(Belgium, Catalonia, NYC). All sets of analysis performed at the country level make use solely of live- 
GPS coordinates. We consider only those countries for which our signal is generated by at least 200 users, 
normalized by their activity and location. So if a user emits 30% of her tweets from a given country she 
will contribute as 0.3 users to that country. 110 countries satisfy this minimum user threshold. 

Finally, it is crucial stressing that every set of statistical measures performed in this paper is done 
at the user level, in order to reduce the noise that bots or cyborgs might add to the analysis. If not 
suitably addressed, in fact, their presence could induce wrong conclusions on the day-to-day behavior of 
the average person |34| . 
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6 Figures 




C) 





Figure 1. Multiscale view of the geolocated Twitter signal. The large number of geolocated 
Twitter traffic allows for a high resolution characterization of human behavior. A) Europe B) Italy C) 
Lazio region D) Rome. The squares highlight the zooming areas.. 
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Figure 2. Ranking of countries by users per capita. Ranking of countries as per average 
number of Twitter users over a population of 1000 individuals. 
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Figure 3. Users and GDP per capita. Correlation between country level Twitter penetration and 
GDP/capita. 
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Figure 4. User Activity. Probability density p(N) of user activity (number of daily tweets N) 
grouped by country (A) and language (B), and by country while considering English tweets exclusively 
(C). Different curves collapse naturally, without any functional rescaling, indicating the presence of a 
seemingly universal distribution of users activity, independent from cultural backgrounds. 
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Figure 5. Languages by number of users. Languages ranked by total number of users. For 
clarity, only languages with more than 30 users are shown. 




Figure 6. Geographic distribution of languages around the world. A) Raw Twitter signal. 
Each color corresponds to a language. Densely populated areas are easily identified, while, as expected, 
languages are well separated among European countries. B) Dominant language usage. The color of 
each country indicates the fraction of users adopting the official language in tweets. Gray represent 
countries without statistically significant signal. 
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Figure 7. Language share of the most active countries. Language adopted by users coming 
from Top 20 most active countries, ordered by number of English tweets. 
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Figure 8. Language polarization in Belgium and Catalonia, Spain. In each cell (600m 
resolution) we compute the user-normalized ratio between the two languages being considered in each 
case. A) Belgium. B) Catalonia. The color bar is labeled according to the relative dominance of the 
language denoted by blue. 
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Figure 9. Language polarization in Montreal, QC, Canada. English and French are 
considered. In each cell (200m x 200m) we compute the user-normalized ratio between English and 
French (excluding all other languages). Blue - English, Yellow - French. The color bar is labeled 
according to the relative dominance of English to French. 
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Figure 10. Language polarization in New York City, NY, USA. The second language by 
district or municipality (in the case of New Jersey state) is shown. Blue - Spanish, Light Green - 
Korean, Fuchsia - Russian, Red - Portuguese, Yellow - Japanese, Pink - Dutch, Grey - Danish, Coral - 
Indonesian 
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Figure 11. Monthly variations in Language use. Fraction of minority languages in specific 
countries as a function of the month. Increases in a specific language share indicate the presence of 
tourists visiting the country Peaks are clearly visible during the local summer period. 
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7 Tables 



Days of data collection 


564 


Tweets/day GPS (live-GPS) 


651,400 (128,385) 


Users (users live-GPS) 


5,962,976 (4,551,384) 


Countries (total) 


191 


Countries (analyzed) 


110 


Detected languages 


78 



Table 1. Basic metrics of the data set. Along with the total GPS signal, the fraction of live updates is 
reported (see Methods for details). 



